Ensemble attack: Meta classifier pipeline #37

sarakodeiri · 2025-09-16T21:51:34Z

Short Description

Clickup Ticket(s): (https://app.clickup.com/t/868fm2hg6)

Both choices of meta classifiers can be trained and saved as pickle files. Predictions can also be done and evaluated with using TPR@FPR. The PR includes preprocessing methods for getting training features as well (e.g. Gower distance, the DOMIAS method.)

Things to note:

The pipeline has been tested with the data provided by the original repository which is very small in size. Other thorough tests are required when all parts of the pipeline are finalized.
The original repository has an "iteration" count and doesn't use it properly. The same has been implemented here.
This implementation is loyal to the original repository about DOMIAS as well, but it might be better if DOMIAS is used only for non-categorical data, which is something to be tested after the pipeline has been finalized as well.

Tests Added

Tests have been added for appropriate model fitting, data preprocessing, and the prediction function.

* Add XGBoost training pipeline * Add XGBoost and LR pipeline * Metaclassifier training finalized

fatemetkl

Really nice implementation of the blending++ pipeline! I also double-checked your implementation against the original ensemble codebase, and it matches perfectly 🙂.

I’ve left some comments, and I’ll address the ones where you’ve tagged me.
One suggestion (feel free to ignore): to clarify the different dataframes in this pipeline, in addition to docstrings, we could also reference the descriptions and diagrams in the example README.

fatemetkl · 2025-09-26T16:03:18Z

examples/ensemble_attack/config.yaml

  trans_json_file_path: ${base_example_dir}/data_configs/trans.json
  population_sample_size: 40000

+# Metadata for real data


Not sure if this is a good idea, but can we create a trans_metadata.json or real_metadata.json file to store this information? I am suggesting this because we have several data config files like trans_domain.json and info.json with similar type of metadata information. We can keep this config.yaml for only attack pipeline related configurations. Then we can set the path to this metadata json file here, similar to trans_domain_file_path and load it in BlendingPlusPlus init.

fatemetkl · 2025-09-26T16:06:15Z

examples/ensemble_attack/run_attack.py

-        log(INFO, "Data processing pipeline finished.")
+    if config.pipeline.run_data_processing:
+        run_data_processing(config)
+    elif config.pipeline.run_metaclassifier_training:


We can probably just have another if here instead of elif since someone might want to run data processing as well as meta classifier training.

fatemetkl · 2025-09-26T16:24:15Z

examples/ensemble_attack/run_attack.py

+        Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy",
+    )
+
+    df_synth = load_dataframe(


Is this the synthetic data generated from the TabDDPM model trained on the df_real_train data from saved at Path(processed_attack_data_path, "real_train.csv")? Maybe one comment to say where this synth data is coming from since we still have not added the code for that would be helpful.

For now, this synthetic data has come from the original repository because I just wanted to test that it would run. I did the same with the RMIA signals file, only using them as placeholders. I'll add the comment for future reference, though. Thanks!

src/midst_toolkit/attacks/ensemble/blending.py

fatemetkl · 2025-09-26T17:11:45Z

src/midst_toolkit/attacks/ensemble/blending.py

+
+        # 3. Get RMIA signals (placeholder)
+        rmia_signals = pd.read_csv(
+            "examples/ensemble_attack/data/attack_data/og_rmia_train_meta_pred.csv"


I know this is temporary because we still don't have the RMIA computation pipeline, but can we get this path from config.yaml instead of fixing it here?

I'd rather keep it like this for now because I'm not passing config.yaml to blending.py and don't plan on doing it. It will be resolved soon though!

Got it! Makes sense 👍

fatemetkl · 2025-09-26T19:14:09Z

src/midst_toolkit/attacks/ensemble/XGBoost.py

+
+        study = optuna.create_study(
+            direction="maximize",
+            sampler=optuna.samplers.TPESampler(n_startup_trials=10, seed=np.random.randint(1000)),


I see this is the random seed used in the original code, but as David suggested we can make it an option for the class and maybe pass here the user specified random_seed (it could be the one user sets in config.yaml).

fatemetkl · 2025-09-26T19:19:06Z

src/midst_toolkit/attacks/ensemble/XGBoost.py

+                        colsample_bytree=trial.suggest_float("colsample_bylevel", 0.5, 1),
+                        reg_alpha=trial.suggest_categorical("reg_alpha", [0, 0.1, 0.5, 1, 5, 10]),
+                        reg_lambda=trial.suggest_categorical("reg_lambda", [0, 0.1, 0.5, 1, 5, 10, 100]),
+                        tree_method="auto",


Out of curiosity, was there an issue with tree_method="auto" if not use_gpu else "gpu_hist", ?

Nope. I wanted it to run on my local without taking up GPU space. Should've just changed use_gpu to false in config.yaml :)

fatemetkl · 2025-09-26T19:30:20Z

.gitignore

+# Trained metaclassifiers
+examples/ensemble_attack/trained_models/
+examples/ensemble_attack/attack_results/
+examples/ensemble_attack_example/


I think we don't have this last one anymore

fatemetkl · 2025-09-26T19:38:21Z

examples/ensemble_attack/config.yaml

+      "categorical": ["trans_type", "operation", "k_symbol", "bank"]
+      "variable_to_predict": "trans_type"
+
+  col_type:


I think I am missing where col_type and bounds are used. Maybe adding some comments here would be helpful.

In the original codebase, bounds are used when they're training the XGBoost model to preprocess categorical columns, while the meta classifier features don't include them at all. I added them just in case, but didn't use them.

As for "col_type", you're right and they haven't been used yet. Not sure if they ever will.

I'll remove both for now and add them as needed.

src/midst_toolkit/attacks/ensemble/blending.py

lotif

Great stuff, thanks for addressing the comments! Approving with just some minor ones now.

As others have commented in this PR as well, please wait for their approval or make sure to address all their comments.

lotif · 2025-09-30T15:24:31Z

examples/ensemble_attack/config.yaml

            "clavaddpm_black_box",
            "clavaddpm_white_box",
-        ]
+          ]


nitpick: no need to add this space here, it messes up the indentation.

lotif · 2025-09-30T15:27:15Z

examples/ensemble_attack/run_attack.py

+    process_split_data(
+        all_population_data=population_data,
+        processed_attack_data_path=Path(config.data_paths.processed_attack_data_path),
+        # TODO: column_to_stratify value is not documented in the original codebase.


I believe this is not true anymore? I see docstrings in that function right now.

I think by "original codebase", @fatemetkl meant the submission repository (link), but I'm not sure what the "TODO" is for. My guess is we test things with stratified columns specified.

Hi,
The "original codebase" in this comment refers to the attack submission. I should have added a link. So sorry for the confusion! I will fix it in my PR.

As Sara mentioned, since this parameter wasn’t documented in the original attack codebase, I added a TODO to experiment with other columns in case the one I specified isn’t the correct one.

mypy.ini

emersodb

Thank you for all the edits and work towards addressing the comments. Most of my previous comments have been address really nicely. I have a few additional small ones and I left a few comments of mine unresolved, as I think a few small changes remain before they are fully address. Let me know if you have questions on any comments!

emersodb · 2025-10-01T19:21:19Z

src/midst_toolkit/attacks/ensemble/blending.py

@@ -1,70 +1,96 @@
-# Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo
+"""Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo."""


Rather than just referring to the "submission repo" I would add a link to said repo here if possible.

emersodb · 2025-10-01T19:27:33Z

src/midst_toolkit/attacks/ensemble/blending.py

            lr_model = LogisticRegression(max_iter=1000)
-            self.meta_classifier_ = lr_model.fit(meta_features, y_train)
-
-        else:


I think it's okay to keep this error if something weird happens (i.e. we expand the enum but forget to update this).

emersodb · 2025-10-02T15:39:38Z

src/midst_toolkit/attacks/ensemble/xgboost_tuner.py

+from midst_toolkit.attacks.ensemble.train_utils import get_tpr_at_fpr
+
+
+optuna.logging.set_verbosity(optuna.logging.WARNING)


Maybe just add a comment here why you're setting this?

emersodb · 2025-10-02T15:40:38Z

src/midst_toolkit/attacks/ensemble/xgboost_tuner.py

+    def __init__(
+        self,
+        input_features: pd.DataFrame,
+        label: np.ndarray,


super minor, maybe call this labels instead of label?

src/midst_toolkit/attacks/ensemble/xgboost_tuner.py

emersodb · 2025-10-02T15:43:24Z

src/midst_toolkit/attacks/ensemble/xgboost_tuner.py

+                ("preprocessing", preprocessing),
+                (
+                    "xgboost",
+                    xgb.XGBClassifier(


I think this comment remains relevant? We have a use_gpu argument, but it isn't making it's way into the class?

emersodb · 2025-10-02T15:50:20Z

src/midst_toolkit/attacks/ensemble/xgboost_tuner.py

+
+    def _evaluate_pipeline_cv(self, trial: Trial, num_kfolds: int) -> float:
+        """
+        Performs cross-validation on the pipeline and returns the mean TPR at a fixed FPR.


Currently, this is doing it at a specific FPR threshold (0.1) and it isn't configurable. So I think the docs need to mention this here, and in the return value.

emersodb · 2025-10-02T15:53:01Z

examples/ensemble_attack/run_attack.py

+        Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy",
+    )
+
+    # Synthetic data borrowed from the attack implementation repository.


I'd add a link to where this was borrowed from.

…te/midst-toolkit into sk/meta_classifier

fatemetkl

Thank you for addressing the comments. Most of my comments are addressed, except for a few minor ones about the attack example config, the DOMIAS documentation, and the extra line in .gitignore, I believe. Happy to discuss 🙂.

fatemetkl

Looks great to me! 🚀

c83ecea Trainer: Refactoring process_pipeline_data function (#54) 2be0f81 Bump astral-sh/setup-uv from 6.7.0 to 6.8.0 (#56) 004fc67 Trainer: general variable renamings on data_loaders.py and adding a couple more enums (#53) 4c2d75f Trainer: Refactoring the _pair_clustering_keep_id (#52) d8fc981 Ensemble attack: Meta classifier pipeline (#37) git-subtree-dir: deps/midst-toolkit git-subtree-split: c83ecea

fatemetkl and others added 30 commits August 21, 2025 10:55

Data processing and shadow models

8b7a781

Merged main

71178c0

mypy fixes

5e4e3d9

Removed rmia component

f0a849c

Added data collection, processing, and tests

7c4d47c

Small fixes

4f963b3

Updated README

7746ab3

Merged main

a812665

Removed some parts that should go with the next PR

a352d36

Small fixes

bb662c7

Simplifying mypy legacy type check to only check .py files

ca0e2d5

Added an example for MIDST competition ensemble attack, updates tests

ba22d92

Fixed docstrings

6a6b99d

fix

8e764e1

mypy fixes

e740b76

mypy fix

8574d22

Added hydra and omegaconf to pyproject.toml

aec7458

David's comments, added a simple bash script to example

5516989

Improved comments and function name

be2ac7d

Updated readme with diagrams

3dadde3

Updated readme

0442cbb

Updated readme

aad598b

Modify file structure

86d1e0a

Modify basic file structure

fb3036a

Modify high-level pipeline. (Needs a lot of cleanup.)

cfa821b

Add DOMIAS calculation (#31)

eb38a75

Add Gower distance (#32)

a6c06dd

Add gower to uv

653d9ac

Metaclassifier Training (#36)

62d0c1f

* Add XGBoost training pipeline * Add XGBoost and LR pipeline * Metaclassifier training finalized

Add predict and TPR@FPR

cdf9f4a

sarakodeiri added 8 commits September 24, 2025 12:52

Add meta classifier type enum

6feaf63

Resolved Marcelo's comments

02eb5ae

Fix tests

0710806

Update

5fecc68

Merge branch 'main' into sk/meta_classifier

605052b

Fix build

e116f9d

Remove ensemble_attack_examples

4f9b5b1

Ruff fix

4ec2631

fatemetkl reviewed Sep 26, 2025

View reviewed changes

sarakodeiri added 3 commits September 29, 2025 11:57

Apply David's comments

a068418

Apply David's comments, pt. 2.

b77b802

Apply Fatemeh's comments

a2837ae

lotif approved these changes Sep 30, 2025

View reviewed changes

sarakodeiri added 4 commits October 1, 2025 10:57

Remove bounds and col_type

418d65b

Merge branch 'main' into sk/meta_classifier

2c034cc

Remove xgboost.py

651b2e2

Expand comment

129699d

emersodb requested changes Oct 2, 2025

View reviewed changes

emersodb and others added 4 commits October 2, 2025 12:08

Merge branch 'main' into sk/meta_classifier

ac4e5f5

Merge branch 'sk/meta_classifier' of https://github.com/VectorInstitu…

5f17c07

…te/midst-toolkit into sk/meta_classifier

Resolve some comments

6bfc6c9

Resolve all comments

d38ef66

fatemetkl reviewed Oct 3, 2025

View reviewed changes

sarakodeiri added 2 commits October 3, 2025 14:04

Modify test

a63604d

Minor fixes

7572177

sarakodeiri requested a review from emersodb October 3, 2025 18:23

fatemetkl approved these changes Oct 3, 2025

View reviewed changes

sarakodeiri merged commit d8fc981 into main Oct 6, 2025
5 of 6 checks passed

sarakodeiri deleted the sk/meta_classifier branch October 6, 2025 16:06

		@@ -1,70 +1,96 @@
		# Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo
		"""Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo."""

		from midst_toolkit.attacks.ensemble.train_utils import get_tpr_at_fpr


		optuna.logging.set_verbosity(optuna.logging.WARNING)

Ensemble attack: Meta classifier pipeline #37

Ensemble attack: Meta classifier pipeline #37

Uh oh!

Conversation

sarakodeiri commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Short Description

Tests Added

Uh oh!

fatemetkl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fatemetkl left a comment

Choose a reason for hiding this comment

Uh oh!

fatemetkl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

sarakodeiri commented Sep 16, 2025 •

edited

Loading